Abstract

This study will explore the changes and evolution in the language of Disney Pictures movies over time through the computational lenses of topic modelling and sentiment analysis. An initial round of clustering offers insight into prevalent topics and allows us to understand how movies are thematically grouped together. Then, if warranted, additional analyses within clusters offer insights into differences between male and female leads, how tale-related language evolves over time and how this kind of motion picture deals with themes such as adventure, future and technology.


Introduction

The reason behind the choice of our subject is a deep fascination for the imaginary worlds crafted by Disney and for the language used in the company’s motion pictures. Since these movies span almost a hundred years, ranging from classic to contemporary, it was also the perfect opportunity to study chilren-centric language from a diachronic point of view. The challenge mainly consisted in the normalization and treatment of such orally-bound and children-specific language in such a way that could allow us to gather meaningful insight. Since the object of our study was a collection of oral texts extracted from movies, we quickly identified many challenges such as the oral nature of these texts and the widespread use of narrative expedients that complicate computational processing, such as flashbacks. When we took up the project we chose to include in the study any and all motion pictures from Disney Pixar studios up until that moment. When this choice was made, 59 movies had been created and distributed, and were therefore included, from 1937 to 2021. After gathering and cleaning the textual data we applied distant reading techniques on the dataset. These methodologies were aimed at finding recurrent patterns and schemes in poorly structured data. After an initial analysis and a consultation with our supervising professor, we decided to carry out a double analysis: topic modelling allowed us to group documents - our subtitle files - based on which subjects occurred in the texts and sentiment analysis gave us insight into the development of relatively positive or negative sentences throughout the movie.

The tools we employed were the following:


MALLET

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.1

This package was the starting point of our analysis, as it allowed us - interested humanists - to rise to the challenge of complex computational analyses without the steep learning curve of more complex tools. The CLI - command line interface, the kind of application or toolkit one can use via terminal commands - was the perfect balance between control over the analysis and output and complex computations which required programming knowledge in Java.

Additionally, the toolkit is Open Source software, released under the Apache 2.0 Licence, and widely used in the field, which also meant there is a large number of resources and solutions to common problems online.


Syuzhet

Syuzhet is one of the two terms describing a narrative composition, along with the fabula, theorized by Russian Formalists Victor Shklovsky and Vladimir Propp. It refers to the “device” or technique of a narrative and is concerned with the manner in which the components of a story are organized.

This is the name chosen for an R language based package specifically targeted at natural language processing analyses. The package incorporates four different lexicons:

  • Syuzhet (default)

  • Bing

  • Afinn

  • Nrc

Its main goal is making NLP and especially sentiment analysis in textual data widely available in a simple and direct way. This particular kind of analysis reveals the emotional shifts that serve as proxies for the narrative movement between conflict and its resolution.2



Web Scraping

After deciding on the time window of reference for the research, which spans from 1937 (the year when Snow White and the Seven Dwarfs was released) to 2021 (the year this research first started), we needed to gather all relevant titles. The Wikipedia page for Disney movies felt like the perfect place to start. We downloaded the html page using the requests module for Python and subsequently parsed the document tree using beautifulsoup, an XML and HTML parsing library in Python.

from bs4 import BeautifulSoup, PageElement
import json
import requests

# Wikipedia page for "Disney Movies"
DISNEY_URI = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Animation_Studios_films"

# Retrieve the webpage in HTML format
response = requests.get(DISNEY_URI)

if response.status_code >= 200:
    with open("disney_titles.html", "r") as fp:
        data = fp.read()
        # Turn html string into Soup object
        soup = BeautifulSoup(data, features="lxml")
        # retrieve all table rows containing movie titles
        rows: List[PageElement] = soup.find_all("tr")

        disney_movies = dict()
        # Find first td element (title) and second td element (year) and aggregate in dict object
        for row in rows:
            disney_movies[row.find_all("td")[0].text.strip("\n")] = {"release_date": row.find_all("td")[1].text.strip("\n").replace(" ", " ")}
        
        # Save dict as disney_titles.json
        with open("disney_titles.json", "w") as outfile:
            json.dump(disney_movies, outfile)

Once this step was over we had a JSON file containing a map of each movie title together with its release year, in the format { "Snow White and the Seven Dwarfs": {"year": 1937} }. To gather all subtitles for these files we needed an open collection of subtitles, and we found OpenSubtitles’ service to fit perfectly our needs. They provide an open REST API, so after obtaining a key and getting comfortable with the documentation we quickly turned the list of titles in JSON into a folder of .srt files. .srt files are very easy to work with, since they are written in plain text and the formatting is very predictable. Since at this stage we were working in Python, we decided to clean the raw subtitles using a Python library called pysrt which proved essential to extracting textual data from the .srt files. Concurrently, we noticed that many texts were rich with html tags, descriptions of surroundings, advertisements and so on. While gathering the texts we therefore also started cleaning them. This is one of the functions used to remove unwanted textual data from our subtitles:

def parse_subs() -> Dict[str, list]:
    """ 
    Turn subtitle files in object:
    {"movie_name_YEAR": ["text"], ...}
    """
    subs_directory = "subs/"
    final_object = {}
    for file in os.listdir("./subs"):
        # Parse .srt file for easier handling
        try:
            srt  = pysrt.open(subs_directory + file)
        except UnicodeDecodeError:
            print(f"Error handling file: {file}\nSkipping...")

        # Remove opensubtitles ads and intro: 
        opensubs_ads = r'(♪)|(Advertise your product or brand here)|(contact www\.OpenSubtitles\.(org|com) today)|(Support us and become VIP member)|(to remove all ads from www\.OpenSubtitles\.(org|com))|(-== \[ www\.OpenSubtitles\.(org|com) \] ==-)|((((Subtitles by )|(Sync by ))(.+))$)|(font color="(.+)?")|(Provided by(.+)$)|(^(https?):\/\/[^\s\/$.?#].[^\s]*$)|(Please rate this subtitle at (.)+$)|(Help other users to choose the best subtitles)'
        remove_ads = re.sub(re.compile(opensubs_ads), "", srt.text)
        # Remove html tags, dashes (dialogues), returns
        remove_curly = re.sub(re.compile(r"\{.*?\}"), "", remove_ads)
        remove_html = re.sub(re.compile(r"((<[^>]+>)+)"), " ", remove_curly)
        remove_html_closing = re.sub(re.compile(r"((<\/[^>]+>)+)"), " ", remove_html)
        remove_dashes = re.sub(re.compile(r"-\s"), " ", remove_html_closing)
        remove_returns = re.sub(re.compile(r"[\r\t\n]"), " ", remove_dashes)
        # allowed_chars = string.ascii_letters + " " + "'" + "-" + "."
        # remove_punctuation_lowercase = "".join([char.lower() for char in remove_returns if char in allowed_chars])
        remove_double_spaces = re.sub(re.compile(r"(\s+)"), " ", remove_returns)
        remove_starting_spaces = re.sub(re.compile(r"(^\s)"), "", remove_double_spaces)
        year = file.split("_")[-1].strip(".srt")
        title = "_".join(file.split("_")[:-1])

        final_object[title] = [year, remove_starting_spaces]

    return final_object 

Finally, we were done scraping and cleaning data. At this point the output of the first round was pickled (serialized in a python-specific library) for future manipulation and saved as the first dataset.



2. Topic Modelling

Data Pre-processing

We started the topic modelling stage by iterating MALLET for the first time over the files resulting from the cleaning phase described section 1. We noticed immediately that texts needed further cleaning because MALLET’s stop-words dictionary and the pruning command had trouble avoiding/lowering the presence of noisy words from the clusters we wanted to create, making the latter hard to understand. To resolve this first issue we created an additional python script and fed it with our subtitles files contained in directory ./txts.

from nltk.tokenize import word_tokenize
import spacy
import os.path 
import nltk
import re


NER = spacy.load("en_core_web_sm")
path = "nn_txts/"
for file in os.listdir("./txts"):
    with open("./txts/"+file, "r")  as new_file:
        text = new_file.read()
        stripped_text = []
        parsed = NER(text)
        for word in parsed.ents: #automatic detection of person and organizations
            if word.label_ == "PERSON" or word.label_ == "ORG": 
                text = text.replace(str(word), "")
        tokens = word_tokenize(text)
        tagged = nltk.pos_tag(tokens)
        for word, tag in tagged:
            if tag == 'NN' and len(word)>4: 
                stripped_text.append(word)

We imported spacy, nltk and re libraries performing NLP tasks such as PoS tagging and Name Entity Recognition. Their methods were used to clean up dialogues from character from non-lexical phenomena, names and stop-words. For this purpose we first used spacy’s en_core_web_sm data set to parse words and recognize characters’ names labeled as "PERSON" and "ORG" by the parser, later nltk’s PoS tagger was used to recognize "NN" (i.e., English nouns) with len() equal or greater than four characters and filter out all other words.

After this step dialogues were fed back to MALLET, which produced new clusters of meaningful meaningful words that made sense with one another to human readers.

Finally clusters were improved by removing words recognized as nouns, but containing symbols (in our particular case we had issues with apostrophes) in order to avoid redundancies – inside and among clusters – given by different forms of the same word.

The resulting files were saved into directory cartoonlp/nn_txts.

        for word in stripped_text:
            if re.search(r"\w+[']\w+?",word): #recognize words with aphostrophes such as pronouns
                stripped_text.remove(str(word))
                
        new_string=" ".join(str(x) for x in stripped_text)
       
        out_file=open(path+file,"w")
        out_file.write(new_string)
        out_file.close()


Text Mining

Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights.

IBM

Topic modelling is an unsupervised text mining method we decided to perform along with sentiment analysis. We combined both techniques not only to discover which topic characters talk about the most, but also to understan whether they talk positively or negatively when they talk about such topics.3

First we imported the pre-processed files from directory /nn_txts into MALLET and removed English stop words – if any – detected with command --remove-stopwords.

mallet import-dir \

--input sample-data/nn_txts \

--output disney_topics.mallet \

--keep-sequence\

--remove-stopwords


In the exploratory phase we ran MALLET multiple times over the corpus, parameters were modified from time to time considering the peculiarities of our corpus such as its limited dimension (i.e., 1.9 MB for 59 files), and target-specificity of the language used in children’s movies. Thus, topics had to be appealing or at least related to kids’ everyday life and likes (e.g., one topic could concern school, family, games emotions or different genres of tales).

We detected as useful input parameters for train-topics:

  • –num-topics: the actual number of topics created.

    Considering the abovev mentioned characteristics of the Disney corpus we set 5 as a starting value and increased it up to 15, at this point we decided clusters where satisfying: each cluster was more intelligible, homogeneous and words made sense with the movies they were assigned to

  • –optimize-burn-in: the number of iterations before hyper-parameter optimization begins. Default is twice the optimize interval.4 It was raised to to 60 (default was twice of interval, i.e., 40) since we noticed it helped with topics’ homogenization.


Numbers of iterations and optimization interval were kept as default since they worked for us:

  • --num-iterations 1000

  • --optimize-interval 20


The final input parameters we gave MALLET are the following, and after that run we decided to keep the resulting clusters.

mallet train-topics --input disney_topics.mallet \

--num-iterations 1000 \

--optimize-interval 20 \

--num-topics 15 \

--optimize-burn-in 60 \

--output-state disney-topic-state.gz \

--output-topic-keys disney_keys.csv \

--output-doc-topics disney_composition.csv \

--xml-topic-report disney_report.xml


Topics analysis

Once the topics were ready, we imported the resulting .csv files mallet_keys.csv and mallet_values.csv into RStudio. Dates were also added to the latter to chronologically order the movies for analysis purposes.

Head of the first data frame extracted from our analysis


We proceeded by plotting the stack bar of mallet_values data frame. This kind of visualization helped us to understand the topics distribution over time and in which movie they were present and visible.

By looking at the chart below we can already distinguish at this early stage of our analysis three distinct ways in which topics are distributed over time:

  1. Topics spread all over the considered time range (e.g.,T5 and T8)

  2. Topics concentrated in one or more specific times pans (e.g.,T10 and T13)

  3. Rare topics only appearing once in a while (e.g,T7)


In order to get more accustomed to the data we gathered, we looked at the count of movies for each topic, to see how movies are distributed among topics. To do so we have plotted a line chart for each topic.

We noticed that all the topics have a few movies in which their weight value is above 0.20 and selected this number as a minimum threshold for choosing which movie to include in which cluster. This threshold was employed in the selection of movies per-topic tabulated below in matrix.

Matrix data frame


Plot bar of the number of movies per topic


The following script shows how we built the tables for each cluster’s movies:

movies <- c()
dates <-c()
t_weight<-c()
for (i in rownames(matrix)) {
   title<- matrix[i, "movie"]
   
   date<- matrix[i, "date"]
   row <- mallet_values[match(title,mallet_values$Title)+1, ]
   w <- row$T1
 if (matrix[i, "Topic1"] ==1){movies<- c(movies, title)
   dates<- c(dates, date) 
   t_weight<- c(t_weight, w)}
 }
 
cluster_T1 <-data.frame(movies, dates, t_weight)


Movies in T1

thing friend medal jungle wheel video arcade stuff track racer glitch building today virus man-village buddy inurity march credit princess


Movies in T2

dream world birthday child kingdom tower magic stone sword power witch crown blood flower story miracle tomorrow gleam light castle

Movies in T3

captain treasure diamond emperor pirate llama world order silver leader shadow flight singing house woman cyborg career chief shirt cliff


Movies in T4

money street sheriff woman mouth kitty uncle range horse carpet reward partner trail alley sultan outta church minute property permission


Movies in T5

thing heart family father chance river sister moment truth point daughter fault spirit death danger strength question choice sword place


Movies in T6

thing honey fellow goodness friend doctor house moment tummy narrator brain queen stuff mouse thought prize sense chapter bother bottle


Movies in T7

bunny savage world father conscience actor school plenty predator otter crime officer chain number couple whale traffic alert police system


Movies in T8

place night mother minute morning friend thing matter trouble house surprise hurry business earth tonight creature goodness charge today devil


Movies in T9

heart water brother island village mountain world ocean voice monster share story earth stuff mission board chicken journey darkness ground


Movies in T10

gaucho plane circus angel motion elephant climax planet samba potato peanut shelter knife picture lilongo stand saddle roller stitch feather


Movies in T11

future today machine family science school buddy garage chance cover story class question invention baseball robot project problem companion control


Movies in T12

dream bridge grandfather crystal adventure power round excitement motorcar paper schoolmaster source price court country mania language flight police decision


Movies in T13

music heart story hurry dress spring number stuff dream window sound slipper beauty romance picture country matter tonight sweet glass


Movies in T14

beast master castle monster father world watch party rabbit trouble afternoon child apple chance dinner pardon fault spell advice guest


Movies in T15

prince world princess magic water voice dragon future night palace forest today daughter sense problem light thing reason bayou restaurant


Findings

This analysis proved to be quite interesting with regards to some topics but, on the contrary, some of the extracted movie clusters are not particularly interesting. Clusters like T8 mostly represent the vast prevalence of oral language and do not mirror a specific language or theme, which also explains their widespread presence in the given data-set.

On the contrary, some of the clusters proved vital to understanding our data-set and to understand which clusters and movies could be worthy of additional analyses. Targeted clusters were gathered and noted for a specific sentiment-analysis.

Specifically, we found:

 • T2 looks like general tale-related language, and consists mainly of classic well-known movies which share a great deal of language. Many of these were released around the 1960s, but it was indeed interesting to see a more recent picture (Tangled, 2010) included in the cluster. 
This was one of the clusters selected for a targeted sentiment analysis because of this similarity between chronologically distant movies.

 • T3 mainly consists in adventure-related language therefore present in movies featuring this theme in a prominent way.
An additional analysis could shed some light on the differences or similarities between movies from 1953, 1990 and 2002, spanning many years.

 • T5 appears to be family-related language, and seems to be prominent in many movies. An interesting feature of this table is that there seems to be some kind of correlation between movies with a strong male vs female lead, where the former have fewer topic-related language while in the latter the weight is generally higher. This observation warrants additional analysis.

 • T8 is the cluster which includes the greatest number of movies between those extracted. While we initially thought this could be interesting, we quickly realised that this behaviour is due to the fact that the topic is not a real “topic” but a collection of fable and story-related language. We decided to discard this cluster because of this.

 • T9 seems to deal with nature and wandering. From the movies included it could look like there was an increase in the presence of this topic from the end of the XX century, so we decided to analyse these movies from a sentiment point of view.

 • T11 could be technology and future-related language, it mainly appears from the 2000s onwards and it could be interesting to see if these movies are generally positive or negative-leaning.

 • T13 is another tale or story-related language cluster, but this time we noticed all movies have a prominent soundtrack and fewer dialogues. A sentiment analysis of four of these music-centric movies could be interesting for our research.

 • T14 consists in magical or fantastical words. Included movies seem to warrant this idea, but we’d be interested in an additional sentiment analysis to understand the different ways this theme is portrayed in 1951 and 1991.

 • T15, finally, proved to be an interesting cluster, as it looked like it could include many strong female leads. We decided to focus our attention on this train of thought and included in additional research all movies led by women to understand the difference in perception from 2000 to today.



Sentiment analysis


The data we needed for starting the sentiment analysis was contained in file 03_out_dataframe.csv created with the script in our GitHub repository 03_nltk_processing.py.

library(syuzhet) #enables Syuzhet package for the sentiment analysis

df <- read.csv(url("https://raw.githubusercontent.com/fcagnola/cartoonlp/main/03_out_dataframe.csv"))

paged_table(head(df))


For this kind of analysis we extracted the dimensions of df we were interested in: Release year, Title, and Text as it is the output of the first cleaning in Section 1.

We choose these three parameters to chronologically position each movie as a distinct item and let all the tokenization tasks to be carried out by Syuzhet.

texts_df<- df[, c("X", "Year", "Text")]
texts_df<- texts_df %>% rename(Title= X)
texts_df <- texts_df %>% arrange(Year)
for( i in rownames(texts_df) ){
  string <- texts_df[i, "Text"]
  count <- lengths(gregexpr("\\W+", string)) + 1
  texts_df[i, "Lenght"] = count
}

An example of the final data frame is illustrated here


Experiments

The following script shows how we created and manipulated and the syuzhet vectors for each cluster’s movies:

For each one of the following experiments, the first step was to extract the dialogues from the texts_df, split them into sentences with get_sentences() function and create a vector containing sentiment values computed with the "syuzhet" method, that we choose because it is tuned to fiction.

text1 = "Sleeping_Beauty"
row_1 <- texts_df [match(text1, texts_df $Title ),]
string_1<- row_1$Text

text2 = "The_Sword_in_the_Stone"
row_2 <- texts_df [match(text2, texts_df $Title ),]
string_2 <- row_2$Text

text3= "Tangled"
row_3 <- texts_df [match(text3, texts_df $Title ),]
string_3 <- row_3$Text

v_1<- get_sentences(string_1)
v_2 <- get_sentences(string_2)
v_3 <- get_sentences(string_3)

sv_1<- get_sentiment(v_1, method="syuzhet")
sv_2<- get_sentiment(v_2, method="syuzhet")
sv_3<- get_sentiment(v_3, method="syuzhet")


In order to make movies with different word-lengths comparable, we followed the instructions provided by Syuzhet developers for normalizing the vectors and visualizing them together on a line chart to see differences in their sentiment shifts.

The first step consisted in smoothing and re-scaling syuzhet vectors with rescale() and loess()functions respectively. At this point vectors still had different lengths, that could not be compared mathematically. Therefore we apply the same formula used in the tutorial, sampling vectors into 100 sampled points and plotting them on the graph.5

#normalization for comparison

x1 <- 1:length(sv_1)
y1 <- sv_1
raw_1 <- loess(y1 ~ x1, span=.5)
line1 <- rescale(predict(raw_1))
x2 <- 1:length(sv_2)
y2 <- sv_2
raw_2 <- loess(y2 ~ x2, span=.5)
line2 <- rescale(predict(raw_2))
x3 <- 1:length(sv_3)
y3 <- sv_3
raw_3 <- loess(y3 ~ x3, span=.5)
line3 <- rescale(predict(raw_3))

sample_1 <- seq(1, length(line1), by=round(length(line1)/100))
sample_2 <- seq(1, length(line2), by=round(length(line2)/100))
sample_3 <- seq(1, length(line3), by=round(length(line3)/100))


Experiment on T2

As we choose to plot dialogues in which T2 had weight >= .40, we noticed that the lines of the two movies released in the mid-twentieth century have completely different shapes,suggesting there is no binding correlation between selected movies and their sentiment trend. The graph becomes interesting when we also look at Tangled’s curve (2010), which has the peculiarity of being a sort of historical average between the former movies.

This particular graph suggests that there is no fixed trend in the sentiment curve of movies sharing the same topic, but there is the possibility that some combinations in the sentiment shifts might be characteristic of a determinate cluster.


Experiment on T3

In cluster T3 we selected Peter Pan from 1953, the Rescuers from 1977 with its sequel released in 1990 and finally a movie of the early 2000.

Comparing the two movies at the opposite poles of the timeline we consider here, we notice that the Peter Pan’s curve is more excited and swinging with respect to the Treasure planet. The first Rescuers’ movie, instead, looks like a pivot between the two of them suggesting that throughout years the plot of movies clustered in topic 3 might have been smoothed in terms of sentiment shifts.

Finally we have also plotted the sequel of the rescuers that was released 13 years later than the original one. The comparison between these two movies is interesting because they start in two different ways – the first very positive, while the second is extremely negative – but they share the time position of local maxima and minima on the graph.


Experiment on T5

For topic 9 we decided to focus on movies released within the same decade, the nineties. Here we notice a distinction in sentiment values between movies having male leads from those with female leading characters. Tarzan and Hercules have indeed the characteristic of ending with highly positive interactions and dialogues compared to Mulan and Pocahontas. Additionally we noted that movies with female leads both finish with a value 0.5 point lower than the beginning.


Experiment on T9

In the analysis of topic 9 we selected those movies released between the end of the nineties and the beginning of two thousands including Moana which has the highest weight value in this cluster. By means of this graph we found out that movies released after the year 2000 tend to end with negative sentiment values, while Tarzan and Dinosar end with highly positive sentiment in their dialogues.

Additionally the graph shows the overall similarity of the curve of the first three movies that are close to each other in terms of time (1 ~ 4 years apart from each other) and how it evolved after 13 years with Moana. This last movie has indeed a more swinging curve, it is the only one with values above 0 in the middle of the plot (others are positive only at the beginning and/or at the end of the story), and its curve decreases when others are increasing.


Experiment on T11

T11 was selected for analysis because we though it could be worthy to compare the emotional valence between two movies having technology as main topic and being released almost a decade apart from each other: Meet The Robinsons (2007) and Bigh Hero 6 (2014).

Here we noticed that the first movie, is the one with the most swinging curve. It starts in a neutral manner decreasing in emotional valence for the first quarter rof the film, while the second movie’s curve acts in the opposite way (from very negative, -1, to positive values). By the beginning of the second quarter of the x axis, both movies start decreasing again, but as Big Hero 6 settles to an average value around -0.5 for the majority of the remaining narrative time, Meet The Robinsons keeps decreasing untill the finale in which the emotional valence of their dialogue increases up to 1.

We suppose these values could be somehow indicative of the changes in the approach towards technology.


Experiment on T13

For cluster 13 we analyzed three movies released between 1942 and 1950. Here we notice that the sentiment valence, at least at the beginning, is quite similar between Bambi and Cinderella, and never exceeds the value of 0 until the finale. On the other side Make Mine Music dialogues have a similar valence in the first half of the movie, but drastically change at its half and increases in positivity reaching the value of 1 around point 70 in its narrative time sample.

This peculiarity Make Mine Music with respect to Bambi and Cinderella might be explained by the fact that its structure (composed by 10 short-films) does not undergo the same restrictions of the plot of a traditional story like the other two movies, but is representative of – and influenced by – the deliberate editing and artistic choices made by producers.


Experiment on T14

T14’s movies Alice in Wonderland and The Beauty and the Beast have been released 40 years apart from each oter and we decided to compare the two in order to understand if there were some similarities left in dialogues’ emotional valence, despite their distance in time.

As we can see from the plotted graph, we have an interesting situation, in which the older film starts and ends with values below the 0 and reaches its maximum positivity in dialogues exactly in the middle of the narrative time. Meanwhile the 1991’s movie has a more complex curve in terms of emotional shifts and starts in a positive way.


Experiment on T15

The movies in cluster 15 all have similar curves and – as time passes – we notice an increase of positive emotional valence especially in the central part of the narrative time. The increase of positive values could be related to the increase of importance of the topic in the movies since it follows the same trend of weights:

  • Princess and the Frog has T15 weight of 0.4825248 and mean emotional valence of 0.1783544

  • Frozen has T15 has weight of 0.3435776 and mean emotional valence of 0.1445293

  • Raya and the Last Dragon has T15 weight of 0.5064836 and mean emotional valence of 0.2095595

Finally, we note that, despite having different curves, all movies end with roughly the same trend and sentiment value.



Conclusions


While we gathered many valuable insights, we are also aware of the limitations of our study, mainly due to our limited knowledge and skill in the field of data science as well as to our small dataset.

Some interesting conclusions emerged nonetheless, mainly from the intersection between extracted data and the historical, social and cultural background of the students involved in the project. This humanities background, together with our passion for digital methodologies, allowed us to notice differences in movies with male or female leads, or to better understand and explain seemingly random differences that were in fact due to the socio-cultural customs of the times.

We hope this introductory analysis can prove to be a useful starting point for further explorations of this topic. We strongly feel that this field - natural language aimed at children - while rarely explored can be quite interesting because of its dual nature. In a seemingly transparent way, since we’re dealing with a live natural language, it mirrors some cultural features directly, but at the same time it also contributes to the formation of young citizens, and it’s thus involved in shaping tomorrow’s society.




Web Resources

https://mimno.github.io/mallet/index

https://senderle.github.io/topic-modeling-tool/documentation/2018/09/27/optional-settings.html

https://github.com/mjockers/syuzhet

https://www.qualtrics.com/uk/experience-management/research/text-analysis/?rid=ip&prevsite=en&newsite=uk&geo=IT&geomatch=uk

https://www.ibm.com/cloud/learn/text-mining


  1. cit. https://mimno.github.io/MALLET/index↩︎

  2. cf. https://github.com/mjockers/syuzhet↩︎

  3. cf.https://www.qualtrics.com/uk/experience-management/research/text-analysis/?rid=ip&prevsite=en&newsite=uk&geo=IT&geomatch=uk↩︎

  4. cit. https://mimno.github.io/MALLET/topics↩︎

  5. cf. https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html↩︎